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Abstract 

We study the problem of similarity detection by sequence alignment with gaps, using 
a recently established theoretical framework based on the morphology of alignment 
paths. Alignments of sequences without mutual correlations are found to have scale- 
invariant statistics. This is the basis for a scaling theory of alignments of correlated 
sequences. Using a simple Markov model of evolution, we generate sequences with well- 
defined mutual correlations and quantify the fidelity of an alignment in an unambiguous 
way. The scaling theory predicts the dependence of the fidelity on the alignment 
parameters and on the statistical evolution parameters characterizing the sequence 
correlations. Specific criteria for the optimal choice of alignment parameters emerge 
from this theory. The results are verified by extensive numerical simulations. 
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1 Introduction 



Sequence alignment has been one of the most valuable computational tools in molecular 
biology. It has been used extensively in discovering and understanding functional and evo- 
lutionary relationships among genes and proteins. There are two basic types of alignment 
algorithms: algorithms without gaps such as BLAST and FASTA (Altschul et al, 1990), and 
algorithms with gaps, for example, the Smith- Waterman local alignment algorithm (Smith 
and Waterman, 1981). Gapless alignment is widely used in database searches because the 
algorithms are fast (Altschul et al, 1990) (computational time scales linearly with sequence 
length), the results depend only weakly on the choice of scoring systems (Altschul, 1993), and 
the statistical significance of the results is well-characterized (Arratia, et al., 1988; Karlin 
and Altschul, 1990; Karlin and Altschul, 1993). However, gapless alignment is not sensitive 
to weak sequence similarities (Pearson, 1991). For a detailed analysis, algorithms with gaps 
are therefore needed (Waterman, 1989; 1994). 

At present, there are two main obstacles to the wider application of these more pow- 
erful tools. They require substantially longer computational time than gapless alignments 
(depending quadratically on the sequence length). As computational power continues its 
exponential growth at a rate even faster than the growth of genomic information, we expect 
this constraint to become less stringent in the near future. More importantly, gapped align- 
ments lack a detailed statistical theory assessing the significance of the results. It is this 
second problem we address in the present paper. 

The common algorithms assign a score to each alignment of two or more sequences. 
The score is based on the number of matches, mismatches, and gaps. Maximization of 
this score is then used to select the optimal alignment, taken as a measure of the mutual 
correlations between the sequences. However, it is well known that the optimal alignment 
of a given pair of sequences strongly depends on the scoring parameters used. The same is 
true for the fidelity of the optimal alignment, that is, the extent to which mutual correlations 
are recovered. Hence, the key problem of alignment statistics is to quantify the degree of 
sequence similarity based on alignment data and to find the scoring parameters producing 
alignments of the highest fidelity. This problem has been addressed for gapless alignments 
(Altschul, 1993), based on the knowledge of the exact probability distribution function for 
the optimal scores in gapless alignments of mutually uncorrelated sequences (Karlin and 
Altschul, 1990). For algorithms with gaps, however, not even the leading moments of the 
distribution function have been known so far. Scoring parameters have been chosen mostly 
by trial and error, although there have been systematic efforts to establish a more solid 
empirical footing (Benner, 1993; Vingron and Waterman, 1994; Koretke et al, 1996). 

To guide the choice of scoring parameters, a quantitative measure of the fidelity of an 
alignment is necessary. Since the algorithm is designed to detect residual similarities between 
sequences in a divergent evolution, it is clear that the fidelity measure has to emerge from 
the underlying evolution process. We use a simple probabilistic evolution model to generate 
daughter sequences from ancestor sequences by local substitutions, insertions, and deletions. 
The model is certainly too simple to describe realistic evolution processes, but it allows an 
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unambiguous identification of inherited mutual similarities between sequences. The fidelity 
of an alignment is then simply the fraction of the inherited similarities recovered by it. 
Maximization of the fidelity is used as a criterion to select optimal scoring parameters. 

We will not address here algorithmic and computational aspects of alignments. Efficient 
algorithms are available for parametric and ensemble alignment (Waterman et al, 1992, 
Gusfield et al, 1992, Waterman, 1994). Our goal is to present a statistical theory of gapped 
alignments. This theory can then be used to predict optimal scoring parameters appropriate 
for different classes of inter-sequence correlations. 

As is well recognized, the main mathematical difficulty preventing a quantitative statis- 
tical characterization of gapped local alignment lies within the global alignment regime. In 
two recent communications (Hwa and Lassig, 1998; Drasdo et al. 1998), we have shown that 
the statistical properties in the parameter regime close to the log-linear phase transition line 
(Waterman et al, 1987; Arratia and Waterman, 1994) of the Smith- Waterman local align- 
ment algorithm are in fact dominated by the statistics of global alignment. This regime is 
important for biological applications since it has been found empirically to produce "good" 
alignments (Vingron and Waterman, 1994). It is thus very important to characterize the 
statistics of global alignment, which is the purpose of this paper. We report a detailed study 
of the properties of global alignments of mutually uncorrelated as well as correlated sequences 
by the Needleman-Wunsch (1970) algorithm. The results are used to select optimal scoring 
parameter to detect sequence correlations generated by the toy evolution process. They can 
also be incorporated directly into the parameter selection procedure for local alignment (Hwa 
and Lassig (1998); Drasdo et a/., 1998). 

The statistical theory of gapped alignments presented here is based on a geometrical ap- 
proach introduced recently by two of us (Hwa and Lassig, 1996). This approach focuses on 
the morphology of the optimal alignment paths. The notion of an alignment path (recalled 
below) provides a very fruitful link to various well-studied problems of statistical mechanics 
(Kardar, 1987; Fisher and Huse, 1991; Hwa and Fisher, 1994) as has also been noticed by 
Zhang and Marr (1995). The important statistical properties of alignment paths are de- 
scribed by a number of scaling laws (Hwa and Lassig, 1996; Drasdo et al, 1997) explained 
in detail below. Their applicability to alignment algorithms is supported by extensive nu- 
merical evidence. The resulting scaling theory of alignment has three main virtues: 

(i) It distinguishes clearly between universal (parameter-independent) properties of align- 
ments and those depending on the scoring parameters (and hence governing their optimal 
choice) . 

(ii) It relates score data of alignments to their fidelity and to the underlying evolutionary 
parameters characterizing the similarities of the sequences compared. 

(iii) Its key statistical averages turn out to be significant for the alignment of single sequence 
pairs that are sufficiently long. 

Statistical scaling theories have also been developed for related optimization problems in 
structural biology, notably protein folding (Wang et al, 1996; Onuchic et al, 1997). 

This paper is organized as follows. In Section 2, we define the evolution process, recall the 
global alignment algorithm used throughout this paper, and discuss the qualitative aspects 
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of the geometrical approach. The quantitative theory of alignment starts in Section 3, where 
we give a detailed description of the alignment statistics for uncorrelated random sequences, 
and present the power laws governing alignment paths and scores. In Section 4, we turn 
to sequences with mutual correlations inherited by a realization of our evolution process. 
We establish a scaling theory that explains the parameter dependence of alignments in 
a quantitative way. Hence we derive optimal alignment parameters as a function of the 
evolution parameters, i.e., the frequency of indels and substitutions^. Furthermore, we show 
how the evolutionary parameters and the optimal alignment of a given pair of sequences can 
be deduced from its score data. 



2 The geometrical approach to sequence alignment 

Evolution model 

The evolution process used in this paper evolves from an "ancestor" sequence Q of length 
N ^> 1 whose elements are labeled by the index i. The element Qi is chosen from a set 
of c different letters. Each letter occurs with equal probability 1/c, independently of the 
elements at other positions. Hence, the ancestor sequence is a Markov random sequence. 
The numerical results presented below are for the case c = 4 as appropriate for nucleotide 
sequences, but for some derivations, it is useful to consider general c-letter alphabets. 

The evolution process generates a daughter sequence Q' of length N' from the ancestor 
sequence Q. It involves local insertions and deletions of random elements with the same 
probability p, and point substitutions by a random element with probability p. Insertion, 
deletion, and substitution events at one point of the sequence are independent of the events 
at other points. The evolution process can thus be formulated as a Markov process along 
the sequence (Bishop and Thompson, 1986, Thorne et al, 1991; Hwa and Lassig, 1996). 
The precise evolution rules used in this paper are given in Appendix A. These rules are such 
that the average length of the daughter sequence, N, equals the length iV of the ancestor 
sequence. 

A specific realization of this Markov process defines a unique evolution path linking the 
sequences Q and Q'; see Fig. 1(a). However, the same pair of sequences can be linked by 
different evolution paths. Any evolution path has a number of conserved elements (i.e., 
elements that are not deleted or substituted at any point of the evolution process). The 
average fraction of ancestor elements Qi conserved in the daughter sequence Q' is 

U(p,q) = (l-p)(l-q) , (1) 

where 



1 A conceptually similar link between scoring parameters and evolution parameters has been discussed in 
the context of maximum-likelihood methods (Bishop and Thompson, 1986; Thorne et al, 1991, 1992). 
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Q:G TACTGATG 

♦ + + 

Q: G A G T A T CTG 

(a) 

Q: GTACTGA TG 
Q:G AGT ATCTG 

(b) 



Fig. 1: (a) A Markov evolution path linking the two sequences Q = {G, T, A, C, T, G, A, T, G} and 
Q' = {G, A, G, T, A, T, C, T, G}. Native pairs are marked by bonds with full circles, substitutions by 
bonds with empty circles. The unpaired letters Qi are deleted, the unpaired letters Q'j are inserted, 
(b) A possible alignment between Q and Q' with matches (Qi = Q'j) (full lines), mismatches 
(Qi 7^ Qj) (dashed lines) and gaps (unpaired letters), (c) Lattice representation. The evolution 
path R(t) corresponding to (a) is marked by circles; there are five native bonds (full circles). The 
alignment path corresponding to (b) appears as thick line whose solid (dashed) diagonal bonds are 
matches (mismatches) and whose horizontal and vertical bonds are gaps. It covers three of the five 
native bonds, producing the fidelity T = 3/5. 

is the effective insertion/deletion rate (see Appendix A). We call these conserved pairs of 
elements (Qi = Q'j) native pairs. Their fraction U(p,q) quantifies the mutual similarity 
between sequences. In the remainder of this paper, we take U and q as the basic parameters 
characterizing the evolution process. The primary goals of sequence alignment are to identify 
the native pairs and to estimate the mutual similarity U. 

Alignment and Scoring Scheme 

We align the sequences Q = {Qi} and Q' = {Q'j} using the simplest version of the global 
alignment algorithm by Needleman and Wunsch (1970). A global alignment of two sequences 
is defined as an ordered set of pairings (Qi, Q'j) and of gaps (Qi, — ) and (— , Q'j), each element 
Qi and Q'j belonging to exactly one pairing or gap (see Fig. 1(b)). Any alignment is assigned 
a score S, maximization of which defines the optimal alignment^. We use here the simplest 
class of linear scoring functions (Smith and Waterman, 1981), with the score given by the 
total number N + of matches (Qi = Q'j), the total number N_ of mismatches (Qi ^ Q'j), and 
the total number N g of gaps. Hence, the most general such function involves three scoring 
parameters: 

S = fl + N + + /i_AL + figNg . (3) 

However, the optimal alignment configuration of a given sequence pair Q and Q' is left 
invariant if the three scoring parameters are all multiplied by the same factor. Along with 

2 In statistical mechanics, one may think of — S as an energy that has to be minimized. 
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the property that 2N + + 2iV_ + N g = N is conserved in global alignment and the invariance 
of the alignment configuation to an additive constant to (|3|), we see that the outcome of 
global alignment is controlled effectively by a single parameter.. Without loss of generality^, 
we may therefore choose to use the scoring function 



S = s/c~^\N + --^=N^- 1 N g , (4) 

y/c- 1 

which is normalized in such a way that a pairing of two independent random elements has 
the average score and the score variance 1. The scoring function S depends only on the 
parameter 7, which describes the effective cost of a gap over pairing. The optimal alignment 
depends on 7 in the regime 7 > 70 = l/(2\/c — 1) (i.e., 2/i 9 > to which we restrict 
ourselves in the sequel. For 7 < 7o , it is always favorable to replace a mismatch by two gaps 
(Waterman et ai, 1987), and the alignment is not biological relevant. 

The fidelity of an alignment 

As discussed above, mutual correlations between the sequences Q = {Qi} and Q' = {Qj} 
arise from the set of native pairs (Qi = Q'j). The fidelity T of an alignment can be quantified 
as the fraction of correctly matched native pairs, see Fig. 1(b). This is an unambiguous 
measure of the goodness of an alignment, and it will be used below to find optimal alignment 
parameters. To evaluate T directly, the native pairs have to be distinguished from random 
matches (Qi = Qj) involving mutated elements. Hence, the fidelity defined in this way 
depends not only on the sequences Q and Q' but also on the evolution path linking them. 
Of course, the evolution path is not known in most applications of sequence alignment. 
However, the scaling theory discussed below relates statistical properties of T to alignment 
data, making it a useful and measurable quantity. 

Lattice representation 

Any alignment of two sequences {Qi} and {Qj} is conveniently represented on a two- 
dimensional N x N' grid as in Fig. 1(c) (Needleman and Wunsch, 1970). The cells of 
this grid are labeled by the index pair (i,j). The diagonal bond in cell (i,j) represents the 
pairing of the elements (Qi, Q'j). The horizontal bond between cells (i,j) and (i,j + 1) repre- 
sents a gap (Qi, — ) located on sequence Q' between the elements Q'j and Q'j +V The vertical 
bond between cells (i,j) and (i + 1, j) represents a gap located on sequence Q between the 
elements Qi and Qi+\. In this way, any alignment defines a unique directed path on the 
grid. Using the rotated coordinates r = j — i and t = i + j, this path is described by a 
single- valued function r(t) measuring the displacement of the path from the diagonal of the 

3 Indeed, the scoring functions (||) and (|J) lead to the same optimal alignment if 

C — 2 C /!+ + /!_ — 2[lg 
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alignment grid. A path associated with an optimal alignment is denoted by r (t). For global 
alignment of typical sequences, the optimal path extends over the entire grid, i.e., it has a 
length of the order N + N' = 2N. The Needleman-Wunsch dynamic programming algorithm 
obtains optimal alignment paths by computing the "score landscape" S (r,t) sequentially 
for all lattice points, where S (r,t) denotes the optimal score for the set of all alignment 
paths ending at the point (r,t). The version of the algorithm used in this paper is detailed 
in Appendix B. 

In a similar way, any evolution path linking the sequences Q and Q' defines a directed 
path R(t) on the alignment grid (called evolution path as well) (Hwa and Lassig, 1996). On 
this path, horizontal and vertical bonds represent deleted and inserted elements, respectively. 
A fraction U of the bonds along the evolution path are native bonds representing the native 
pairs (Qi = Qj). The fidelity of an alignment is then simply the fraction of native bonds 
that are also part of the corresponding alignment path r(t), see Fig. 1(c). 

Alignment morphology 

Alignment algorithms are designed to trace the mutual correlations between sequences. As 
it becomes clear from Figs. 2, the presence of such correlations affects both the morphology 
of the optimal alignment path r (t) and the associated score statistics. Fig. 2(a) shows the 
path r (t) for a pair of mutually uncorrelated random sequences. This path is seen to be 
intrinsically rough; i.e., the displacement has large variations. This "wandering" is caused 
by random agglomerations of matches in different regions of the alignment grid. Fig. 2(b) 
shows the corresponding score landscape S (r,t) for a given value of t. The maximum score 
value occurs at the point ro(t) and is seen to be not very pronounced; near-optimal score 
values occur also at distant points such as 77. The statistics of alignment paths and scores 
for uncorrelated sequences are discussed in detail in Section 3 below. 

The optimal alignment path for a pair of mutually correlated sequences (obtained from 
the evolution process described above) behaves quite differently, as shown in Fig. 2(c). Its 
wandering is essentially restricted to a "corridor" of finite width centered around around 
the evolution path R(t). In this way, the path r (t) covers a finite fraction T of the native 
bonds. The corresponding score landscape is shown in Fig. 2(d). The maximum at ro(t) 
is now very pronounced; all paths ending at points distant from r (t) have a substantially 
lower score than the optimal path. The alignment statistics of mutually correlated sequence 
pairs is described in Section 4. 

The morphology of the optimal alignment path depends strongly on the choice of the 
scoring parameter 7. As an example, Fig. 3 shows the optimal paths r (t) (dashed lines) 
for the same pair of correlated sequences with the same underlying evolution path R(t) (the 
solid line), and for three different values of 7: At small 7, the path r (t) follows the evolution 
path only on large scales. On small scales, variations in the displacement r (t) are seen to 
be larger than those of R(t) (Fig. 3(a)). The intrinsic roughness of the optimal alignment 
path limits its overlap with the evolution path, hence suppressing the fidelity. The fidelity is 
the highest at some intermediate value 7*, where the alignment path follows the target path 
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Fig. 2: (a) The optimal alignment path ro(i) and (b) a slice of the score landscape S(r,t = 4000) 
for a pair of mutually uncorrelated random sequences. The score maximum is at ro, which defines 
the endpoint ro = ro(t = 4000) of the optimal path. Similar score values occur also at distant 
points such as T\. (c) The paths ro(i) (dashed line), R(t) (solid line) and (d) the score landscape 
So(r) at t = 4000 for a pair of sequences with mutual correlations. The score maximum at ro is 
now pronounced; all distant points r have a substantially lower score. Hence the fluctuations of 
the alignment path ro(t) are confined to a corridor around the evolution path R(t). 
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Fig. 3: Optimal alignment paths ro(t) for the same pair of correlated sequences and three different 
values of 7. The evolution path R(t) (solid lines) is the same in all three cases, while the optimal 
alignment paths ro(t) (dashed lines) differ, (a) Random fluctuation regime (7 < 7*). The path 
ro(t) has strong fluctuations since the gap cost is low. (b) Optimal alignment parameter 7 = 7*. 
The fluctuations of the paths ro(t) and R(t) are of the same order of magnitude, (c) Shortcut 
regime (7 > 7*). At high gap cost, the fluctuations of R(t) are dominant, while ro(t) contains large 
straight segments. 
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most closely (Fig. 2(b)). At large 7, the alignment path contains large straight segments 
(Fig. 2(c)), which again reduces the fidelity. 

A qualitative understanding of this parameter dependence may be gained from an analogy 
to random walks, regarding 7"o(£) as the trajectory of walker following a curvy path R(t). 
The intrinsic properties of the walker are parametrized^ by 7. For small 7, the the walker 
is drunk and cannot follow the path R(t) without meandering to its left and right. This is 
the regime of Fig. 2(a), which we call the random fluctuation regime. For large values of 7, 
on the other hand, the walker is lazy and bypasses the larger turns of the path R(t); this is 
the shortcut regime (Fig. 2(c)). From this analogy, it becomes plausible that a walker who 
is neither too drunk nor too lazy will follow the path R(t) most closely and thereby achieve 
the highest fidelity (Fig. 2(b)). Such a criterion for the optimal parameter 7* will indeed 
emerge from the quantitative theory described in the remainder of this paper. 

3 Alignment of Uncorrelated Sequences 

A statistical theory of alignment can hardly predict the optimal alignment for a specific 
pair of sequences. What can be characterized are quantities averaged over realizations of 
the evolution process for given parameters U and q. It will be shown, however, that these 
ensemble averages are also relevant for the alignment statistics of single pairs of "typical" 
sequences provided they are sufficiently long. The approach is different from the extremal 
statistics of the score distribution that has been used to assess the significance of alignment 
results (Karlin and Altschul, 1990, 1993). 

In the absence of mutual correlations (i.e., for U — 0), the statistics of alignments is 
determined by a balance between the loss in score due to gaps and the gain in score due 
to an excess number of random matches. As discussed by Hwa and Lassig (1996), the 
corresponding alignment paths belong to a class of systems known in statistical mechanics 
as directed polymers in a random medium []. The statistical properties of directed polymers 
have been characterized in detail (Kardar, 1987; Huse and Fisher, 1991; Hwa and Fisher, 
1994). We now recall the main results and give numerical evidence of their applicability to 
sequence alignment. 

4 In statistical mechanics, 7 is the effective line tension of the fluctuating path r (t). 

5 This is also known as the problem of first passage percolation. A detailed mathematical analysis of 
the scaling laws presented below can be found in recent works by Licea et al. (1994, 1996). The main 
difference to alignment statistics is the number of independent random variables on a grid of size N x N . 
For directed polymers and first passage percolation, this number is of order N 2 ; for alignments of random 
sequences, it is only of order N (see also Arratia and Waterman, 1994). This difference is, however, irrelevant 
for the asymptotic scaling behavior (Hwa and Lassig, unpublished). A detailed heuristic discussion of this 
equivalence in the context of a number of closely related problems is given by Cule and Hwa (1997). 
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Displacement and score statistics 

The displacement Ar (t2 — ti) = ^0(^2) — ^0(^1) of the optimal alignment path between two 
arbitrary points ti and £2 is found to obey the statistical scaling law 

(Ar (t)) 2 ^A 2 ( 7 ) |t| 4/3 , (5) 

the overbar denoting the average over an ensemble of mutually uncorrelated sequence pairs. 
Eq. (0) is an asymptotic law valid for (Ar (£)) 2 ^> 1, i.e., for t ^> £0(7) = A~ 3 / 2 (7). It 
says that the exponent 4/3 is a very robust feature of the optimal alignment of uncorrelated 
random sequences, independent of the scoring parameter(s) or even scoring schemes used. 
A large gap cost efficiently suppresses the displacement only for the limited range of scales 
t < toil)- On larger scales, the cost of gaps is always outweighed by the gain in score 
from regions of the alignment grid with an excess number of random matches, leading to 
the power law (|5|) with a "universal" exponent. The dependence of the roughness (Ar (t)) 2 
on the scoring parameters (7 in this case) is contained entirely in the amplitude ^(7); this 
dependence is discussed below. The ensemble average ([5]) also describes the displacement 
auto- correlation function of the optimal path for a single sequence pair, defined as an average 
over initial points t\ in an interval T ^> t, 

(Ar (t)) 2 = T- 1 £ (r (tx + 1) - r (ti)) 2 . (6) 
ti=i 

The large displacement fluctuations of the optimal alignment path r (t) are accompanied 
by large variations in its score. For an ensemble of mutually uncorrelated sequences, the 
score average is asymptotically linear in the length N, 

SojNTf) - Ml) N (7) 

for N 1, with a monotonically decreasing coefficient function #0(7). However, the variance 
of optimal score is described by a nontrivial power law 

(AS (iV,7)) 2 ^ 2 (7)iV 2 / 3 (8) 

which is valid in the asymptotic regime iV ^> t (7). The dependence on the alignment 
parameters is again only in the amplitude -8(7), while the exponent 2/3 is universal. The 
ensemble average can be obtained (up to a 7-independent proportionality factor) from a 
single pair of sequences as average in the score landscape So(r,t) over a sufficiently long 
interval 77 < r < 77 + R, 

n+R-i ( n+R-l \ 2 

(AiS'q^, 7)) 2 ~ RT 1 £ S 2 (r,t)- [R- 1 £ 5b(r,t) , (9) 

r=T\ \ r=r\ J 

see Appendix B. We have verified the scaling laws (|5|) and (|8|) numerically for a range of 7 
values; see Figs. 4. The asymptotic behavior is found to set in rather quickly for t > £0(7)- 
The same scaling has been found for a pair of unrelated cDNA sequences (see also Fig. 4), 
which justifies our modeling of individual sequences as Markov chains. A more comprehensive 
study of correlated and uncorrelated cDNA sequences will be presented elsewhere. 
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(AS (t,y)) 2 




Fig. 4: (a) Displacement fluctuations (Aro(i)) 2 of the optimal alignment for several values of 7. 
Averages over an ensemble of 200 mutually uncorrelated sequence pairs are marked by lines, auto- 
correlation functions for a single sequence pair of length N = 10 5 by squares, (b) Score fluctuations 
(AS'o(t)) 2 obtained from the score landscape (|29l ) by Eq. (^) for several values of 7. (c) Displacement 
auto-correlation function and (d) score fluctuations for a pair of unrelated cDNA seqences (P.lividius 
cDNA for COLL2alpha gene (Exposito et al, 1995) and Drosophila melanogaster (cDNAl) protein 
4.1 homologue (coracle) mRNA, complete cds. (Fehon et al, 1994)). The straight lines indicate 
the expected power laws given by Eqs. (||) and (g). 
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Confinement and tilt energies 



A related set of power laws govern the change in the average optimal score So if the alignment 
paths are subject to constraints. For example, the constraint — r c /2 < ro(t) < r c /2 artificially 
confines the paths to a strip of width r c on the alignment grid. This decreases the optimal 
score So since the path ra(t) is cut off from random agglomerations of matches outside the 
strip. For long sequences, this confinement cost becomes proportional to N, and the average 
confinement cost per unit of t is 

SE c (r c - 7 ) ^ r " N ^-^ N ^ <0. (10) 

It obeys the scaling law 

^ C (r c ; 7 )~-C( 7 )r c - 1 , (11) 

with all the parameter dependence contained in the prefactor C ('-/). This relation is valid 
in the asymptotic regime of strong confinement, i.e., for sequences long enough that their 
unconstrained mean square fluctuations (Ar ) 2 (iV) exceed the scale r 2 . According to (Q), 
this condition is satisfied for iV 3> r 3 / 2 t (7). 

In a similar way, the alignment may be constrained by restricting both ends of the 
alignment path to given values of r, for example, r(0) = and r(N) = r . This forces an 
average tilt 9 = r Q /N upon the alignment path, thereby increasing its number of gaps and 
decreasing its number of matches. The resulting tilt cost is again proportional to N for long 
sequences, and the average tilt cost per unit of t, 

w( , ;7)E 5w_5Mi <0 (12) 

is given by 

8E t {B^) ~ -D^ . (13) 

This power law is valid for long sequences (N ^> to (7)) an d small tilt angles {6 2 < to 2 (7)). 
The scaling form of the confinement and tilt energies has been verified numerically, see Fig. 5. 



Parameter dependence 

The scaling laws (§), (|TTD and fll3|) all have the same structure: they are power laws 
with universal exponents and parameter-dependent amplitudes. The scaling theory predicts 
not only the values of the exponents but also universal relations between the amplitudes. 
We have 

A 3/4 ( 7 ) oc £- 3 ( 7 ) oc C( 7 ) oc ZT 1 / 3 ^) , (14) 

the tildes indicating proportionality factors independent of 7 and of order 1. The amplitudes 
are monotonic functions of 7, which become independent of 7 for 7 < 7 . Their asymptotic 
behavior for large 7 can be calculated (Hwa and Lassig, 1996), resulting in C(y) ~ 7 _1 . 
Indeed, we find this amplitude to be well approximated by the form 

C(j) ~ (7 + const. ) _1 (15) 
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Fig. 5: (a) Confinement cost 5E c (r c ;j) for optimal alignment paths in a corridor — r c < ro(t) < r c , 
taken from an ensemble of 200 mutually uncorrelated random sequences, (b) Scaled tilt cost 
5E t (6;~f) /to(j) as a function of the scaled tilt Otofa), for the same ensemble of sequences. The 
curves describe asymptotic power laws with universal exponents and and 7-dependent amplitudes, 
as given by Eqs. (|i~l|) and (13). 
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in the entire interval 7 > 7o- Our numerical data verifying Eqs. flT4]) and fll^D are shown 
in Fig. 6. We also show numerical results for the the average score per unit of t, i.e., the 
function -^0(7) in Ecj- (0). We find -So (7) ~ C(j) approximately. 

4 Alignment of Correlated Sequences 

Displacement fluctuations of the evolution path 

As discussed in Section 2, the mutual correlations between sequences are encoded in their 
evolution path, which is represented by the evolution path R(t) on the alignment grid. This 
path has displacement fluctuations due to the random distribution of insertions and deletions, 
see Figs. 2(c) and 3. However, the statistics of these fluctuations is different from that of 
the alignment paths discussed in the previous Section. Since the evolution is modeled as a 
Markov process, the displacement AR(ti — £2) = R{t\) — Rfo) has the mean square 

AR(t) 2 = q\t\ (16) 

characteristic of a Markov random walk, with q given by Eq. (|2|). The overbar denotes an 
ensemble average over realizations of the evolution process with given values of U and q. 
The ensemble average ([16|) equals the auto-correlation function of a single sufficiently long 
evolution path R(t) defined in analogy to @. 

We may compare the fluctuations AR(t) 2 of the evolution path for correlated sequences 
with the fluctuations Ar (t) 2 of the optimal alignment path for uncorrelated sequences 
(Drasdo et. a!., 1997). This defines a scale t, where these fluctuations are of the same 
order of magnitude: (AR(t)) 2 = (Ar (t)) 2 = r 2 . From Eqs. ([5]) and (|16D , we obtain 

f (7, q) = q 3 /A 6 (l) , f (7, q) = q'/A 3 ^) . (17) 

We call the scales ( |TTD roughness matching scales. For \t\ < £(7, q), the displacement of 
the evolution path exceeds that of the optimal alignment path, while for \t\ > t(j,q), the 
displacement of the alignment path becomes dominant. 

Scaling theory for correlated sequences 

For sequences with mutual correlations (i.e., U > 0), the morphology of the optimal align- 
ment path ro(£) and the score statistics are more complicated than for uncorrelated sequences 
since in addition to the random matches, there are now the native matches along the evolu- 
tion path R(t). Due to these competing score contributions, the problem seems to be beyond 
the means of any rigorous mathematical approach. However, it turns out that the statistics 
of weakly correlated sequences is described with remarkable accuracy by the scaling theory 
developed in the previous Section. 

Consider a pair of weakly correlated sequences of length N ^> 1 with an optimal alignment 
of finite fidelity T > at a given value of 7. Since the optimal alignment path ro(t) and the 
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evolution path R(t) have a finite fraction of common bonds, the displacement fluctuations 
of To (£) remain confined to a "corridor" centered around the path R(t) (see Fig. 2(c)). The 
width r c of this corridor can be defined by the mean square relative displacement 

rl = (r„(t) - R(tW , (18) 

which can again be understood as an ensemble average or equivalently as an average over 
t for a single pair of long sequences. To see this equivalence, we note that by Eq. @, 
the width r c defines a corresponding scale in t direction, t c = r^t^). One can show 
that t c is a correlation length; i.e., points on the alignment path with \t 2 — ti\ > t c are 
essentially uncorrelated. Averaging over uncorrelated regions of the alignment path generates 
the ensemble underlying Eq. ([18]) even for a single pair of sequences if they are sufficiently 
long, i.e., N, N' > t c . 

By confining the alignment path ro(i) to a corridor, mutual correlations act as a constraint 
on its displacement fluctuations. This leads to a score cost as discussed in the Section 3. 
However, the constraint cost must be outweighed by the gain in score due to the native 
matches, resulting in a net score gain per unit of t, 



Wt , T)s !£M! ! M > o, a.) 



with Sq(N, 7) denoting the average score of uncorrelated sequences. 

We now calculate the confinement length r c (U,q,j) and the score gain 5E(U,q,j) in a 
variational approach, treating r c as an independent variable to be determined a posteriori 
from an extremal condition. We stress that this approach is not exact; the main approxima- 
tion consists in treating r and t as continuous variables. 

The constraint cost per unit of t imposed by the evolution path R(t) involves terms of 
the form discussed in the previous Section: (i) If the path r*o(t) is confined to a corridor of 
width r c around the fluctuating path R(t), the tangent to r (t) has a typical tilt 6 ~ q/r c 
with respect to the diagonal of the alignment grid, implying a tilt cost 

^(r c ;g, 7 )~- J D(7)(-) 2 • (20) 

\r c j 

(ii) The confinement cost to an untilted corridor of width r c is 5E C = C/r c . The tilt reduces 
the effective width of the corridor so that the confinement cost takes the form 

5E c (r c ; q, 7) C( 7 ) . (21) 

r c 

On the other hand, the gain in score per unit of t due to the native matches is simply 
SE n — UJ 7 , as it is clear from the definition of the fidelity T . We need to express JF in terms 
of r c . Naively one would expect T ~ l/r c . A detailed analysis shows that this is correct up 
to a logarithmic correction (Hwa and Nattermann, 1995, Kinzelbach and Lassig, 1995, Hwa 
and Lassig, 1996) leading to 

6E n (r c ; U)~U i±^£ . (22) 
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The net score gain is the sum of these contributions, 5E = 5E C + 5E t + 5E n . The resulting 
equation can be simplified by using the scaled variables x = C/U, y = q/U 2 , and SS = SE/U. 
Absorbing all unknown proportionality factors into their definition, we obtain the scaled 
energy gain 

S£{r c ;x,y) = 1 + ~ ~ + ■ ( 23 ) 

r c xV x z J r A c r c 



Maximizing (p3|) then determines the actual value of r c (x,y) = r c (U, q, 7): 

SS(x, y) = max SS(r c ; x, y) . (24) 

Fig. 7 shows numerical data for the fidelity T{x,y) = jF(r c (x, y); x, y) and score gain 
5£(x, y) obtained from single sequence pairs with various values of U, q and 7. As expected 
from this scaling theory, the data points for different parameter sets (U, q, 7) corresponding 
to the same (2, y) collapse approximately. This data collapse will be useful for similarity 
detection. 



Alignment parameter optimization 

The numerical fidelity and score patterns of Fig. 7 have clear maxima J-*{y) = J r (x*(y),y) 
and SS s (y) = 5£(x s (y),y), attained at points x*(y) and x s (y). Fig. 7 also shows the loci 
of these maxima, (x*(y), T*(y)) and (x s (y), 5£ s (y)), as well as the limit curves ^(x, 0) and 
5S s (x, 0) obtained from the scaling theory (i.e., from Eqs. ( ^3l) and (|24]) solved numerically). 
The theory is seen to predict the functional form of these curves in a reasonable way, except 
in the region JF ~ 1 (i.e., r c ~ 1) where the continuum approximation valid in the regime of 
weak similarity breaks down. (The unknown 7-independent proportionality factors for the 
scaling variables x, y, 5S and for T have been determined by fits to the data.) 

The functions x*(y), x s (y), J-*{y), and 5£ s {y) shown in Figs. 8 (a) and (b) encode in an 
efficient way the dependence of the fidelity and score maxima on the alignment parameter 
and on the evolution parameters. Furthermore, it follows from the (numerical) solution 
of (|23| ) and (|24| ) that the confinement length r*(y) = r c (x*(y),y) at the point of maximal 
fidelity satisfies the approximate relation 

r* e {y)~r{x*(y),y) (25) 

in the biologically interesting regime of small q and moderate U (0 < y < 5). The optimal 
confinement length is thus proportional to the roughness matching scale (|17D at that point. 
Hence, this scaling theory is in accordance with the qualitative picture of Section 2: At 
x*(y), the fluctuations of the optimal alignment path ro(t) just match those of the evolution 
path R(t) (see Fig. 2(b)). The shortcut regime (Fig. 2(c)) corresponds to the ascending 
branch (x < x*(y)) of the fidelity curves in Fig. 7(a), while the random fluctuation regime 
(Fig. 2(a)) corresponds to the descending branch (x > x*(y)). 
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Fig. 7: (a) Fidelity J r (x,y) and (b) score gain S£(x,y) obtained from single sequence pairs with 
various evolution parameters U, q and alignment parameters 7. The data for different (U, q, 7) 
corresponding to the same (x, y) collapse approximately, as predicted by the scaling theory. The 
lines are the theoretical loci of the maxima (x* (y) , T* (y)) (short-dashed), (x s (y),5£ s (y)) (long- 
dashed) and the theoretical limit curves jF(x,0), 5£(x,0) (solid). 
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Fig. 8: Alignments of maximal fidelity and of maximal score gain. Theoretical predictions for the 
curves (a) x*(y), x s (y) and (b) 5£ s (y), compared to numerical data obtained from fits to 

the curves of Fig. 7. 



Similarity detection 

The evolution process used in this paper is closely related to a more realistic process for 
the divergent evolution of two daughter sequences and from a closest common 
ancestor sequence Q. Modeling the two evolution paths as independent Markov processes 
with respective parameters U\, qi and U2,q2, one can show that the evolution path linking 
and is again a Markov process with parameters U = U-JJ 2 an d q = q\ + <h + 0(q 2 ). 

For practical alignments, however, the evolutionary parameters U and q are unknown. 
Since they enter the definition of the basic variables x and y, knowledge of the optimal pa- 
rameters x*(y) and x s (y) seems to be of little use for applications. However, these parameters 
can be reconstructed from alignment data, as we will now show for a specific example. 

Consider three sequences QW, and Q (3) related by the evolut ion tree of Fig. 9(a). 
The evolutionary distances Tj are defined in terms of the mutual similarity coefficients 
by 

-log U ij = T i + T j {i,j = 1,2,3). (26) 

We whish to determine t\,t 2 and r 3 from pairwise alignments of the sequences^. Fig. 9(b) 
shows the alignment data 8Eij as defined in Eq. dl9|) for each of these pairs, plotted as a 
function of C(j). To fit the data curve SEij(C) to the corresponding scaled score gain curve 
SSij(x) of Fig. 7(b), we have to divide both axes of the diagram by Uij. In this way, we 
can determine the a priori unknown factors U^, and hence the evolutionary distances Ti, see 
Fig. 9(b). For this example, we obtain U 12 « 0.54, U 13 w 0.43, U 23 w 0.415, and n w 0.22, 
t 2 ~ 0.33, r 3 fa 0.55, which is to be compared with the actual values T\ = 0.27, r 2 = 0.38, 
and r 3 = 0.61 used to produce the sequences. 

6 In this example, we use effective indel rates — log(l — gy) = r(r, + tj) with T = 0.2, but this choice is 
not crucial. 
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Fig. 9: (a) Evolution tree linking three sequences QW, Q< 2 \ and Q( 3 >. The sequences have evolu- 
tionary distances 77, T2, and T3 to the branching point of the tree, as defined by Eq. (p6|), and have 
lengths N\ ~ A2 ~ iV"3 ~ 5000. (b) Alignment data SE12, SE13 and 5E23 for pairwise alignments of 
the sequences at different values of 7, shown as a function of C(j). 5£\2, S£\3, and 8623 obtained 
by rescaling the raw alignment data by respective factors U\ 2 , U13, and U23 such that the max- 
ima of the rescaled curves fall on the theoretical locus (x s (y),5£ s (y)) (long-dashed curve, cf. Fig. 
7(b)). This determines the a priori unknown similarity coefficients Uij, and hence the evolutionary 
distances r». 

Finally, we construct the pairwise alignments of highest fidelity. From Fig. 8(a), they 
are seen to satisfy the approximate relation C?-/C?- = x*j/xL ~ 1.2 for 0.1 < y < 4. 
With the values Cf 2 w 0.23, Cf 3 w 0.225, and C| 3 « 0.254 read off from Fig. 9(b) and 
using Eq. ([15]) with the constants of Fig. 6, we obtain the optimal alignment parameters 
7* 2 ps 1.52, 7J3 ^ 1.59, 723 ~ 1.25. The scaled score maxima 5£f 2 ps 0.26, 5£f 3 ps 0.18, 
5£| 3 ^ 0.15 determine the expected fidelities T\ 2 ps 0.75, JF^ 3 ps 0.58, T23 ~ 0.52 as seen 
from Fig. 8(b). They are in good agreement with the actual maxima T\ 2 — 0-8, T\ z = 0.65, 
JF23 = 0.55 computed by comparing directly to the evolutionary paths. 



5 Discussion 

We have presented a statistical scaling theory for global gapped alignments. Alignments of 
mutually uncorrelated sequences are found to be governed by a number of universal scaling 
laws: ensemble averages such as the mean square displacement of the alignment path or 
the variance of the optimal score follow power laws whose exponents do not depend on the 
scoring parameters. The parameter dependence is contained entirely in the prefactors. This 
universality is comparable to the diffusion law describing a large variety of random walk 
processes on large scales, the only parameter dependence being the value of the diffusion 
constant. In contrast to diffusive random walks, however, we find optimal alignment paths 
to be strongly non-Markovian on all length scales due to random agglomerations of matches 
and mismatches. Hence, the exponents take nontrivial values. The scaling laws also govern 
the displacement statistics of a single alignment path r(t) and the associated statistics of 
partial scores, which makes these concepts applicable to individual alignment problems. 
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This scaling theory is also relevant for the statistics of mutually correlated sequence pairs. 
Two important quantities are the score gain over uncorrelated sequences and the alignment 
fidelity. Both quantities strongly depend on the evolutionary parameters linking the two 
sequences and on the alignment parameters. For a simple Markovian evolution model and 
for linear scoring functions, we have obtained a quantitative description of this parameter 
dependence. In particular, the alignment parameter of maximal fidelity turns out to be 
closely related to the parameter of maximal score gain, which makes it possible to construct 
the alignment of maximal fidelity from a systematic analysis of score data. Moreover, the 
underlying evolutionary parameters (the mutual similarity U and the effective indel rate q) 
can also be inferred from this analysis. 

It is important to understand inhowfar the results of this paper carry over to more refined 
algorithms for the alignment of realistic sequences. The universal scaling laws for uncorre- 
lated sequences should prove to be very robust under changes of the scoring function (such 
as scoring matrices distinguishing between transitions and transversions) as well as changes 
in the sequences (the number of different letters and their frequencies). As corroborated by 
preliminary numerical results, such changes reduce to a different parameter dependence of 
the amplitude functions A,B,C, and D. In particular, we find the universal scaling laws 
to be preserved for the alignment of bona fide uncorrelated cDNA sequences, which also 
validates the Markov model for single sequences. While not affecting the asymptotic uni- 
versality, some scoring functions (for example, systems with affine gap cost distinguishing 
between gap initiation and gap extension) may introduce intermediate regimes where the 
score and fidelity curves are modified. Nevertheless, the fidelity and the score gain remain 
key quantities of an alignment, and their optimal values are closely related. This makes it 
possible to construct optimal alignments on the basis of a statistical analysis of score data. 
This link and the underlying scaling theory are also crucial to the analysis of local alignment 
algorithms, as we have shown recently (Hwa and Lassig, 1998; Drasdo et al, 1998). 
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Appendix A: Evolution model 



The Markov process governing the evolution of a daughter sequence Q' from an ancestor 
sequence Q is specified by the flux diagram of Fig. 10. 






addX 
left of Qj 







Fig. 10: Flux diagram of the Markov evolution process. A realization generates a daughter se- 
quence Q' = {Q'j} from an ancestor sequence Q = {Qi}. The process is characterized by the 
insertion/deletion probability p and the substitution probability p. X denotes a random letter. 



The statistical properties of this Markov process are straightforward to compute. Using 
the notation t = i + j and R = j—i,we find R(t) to be a Gaussian random variable with 

Rjt) = , WItj = qt ? (27) 

where q is given by Eq. (0). This implies in particular that the length N' of the daughter 
sequence is also a Gaussian random variable with 

W = N , (TV 7 - iV) 2 = 2qN . (28) 



Appendix B: Alignment Algorithm 

The dynammic programming algorithm generates the score landscape S(r, t) for all grid 
points by the recursion relation 

f S(r-l,t-l)- 1 
S{r,t) = max{ S{r + 1, t - 1) - 7 } (29) 
{ S(r,t - 2) + s{r,t) 

with 



S M) = (V % r+t)/2 Z Q n {r - t)/2 (30) 

I -7^=1 11 Q(r+t)/2 T Q(r-t)/2 

This recursion relation is evaluated in a restricted alignment grid shown in Fig. 11, which 
limits the computing time to a value ~ T x L. The value of L is chosen according to the 
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Fig. 11: Restricted alignment grid (bounded by thick lines) used for the evaluation of the recursion 
relation (p9|). With initial condition (i), the alignment paths are pinned at their initial point (dot) 
defined to be at t = 0. With initial condition (ii), the score is prescribed along the dashed line 
defined to be at t = 0, namely S(r,t = 0) = 0. 

specific application (see below). Along the strip, we use periodic boundary conditions, i.e., 
S(r — L/2,t) = S(r + L/2,t). (Similar results are obtained for open boundary condition.) 
Two types of initial conditions are used depending on the specific application: (i) r(t — 0) = 
(with t = i+j) or (ii) S(t = 0) = (with t = i+j — L/2). Evaluation of the recursion relation 
stops at t — T. Hence, the optimal alignment path r (t) ends at the point r = r (T) given 
by S(r , T) = min r S (r, T). If the score values S (r, T) are degenerate for different values of 
r, one of them is chosen at random. The entire path r (t) is then found by backtracking it 
from its endpoint tq. Degeneracies are again resolved by a random choices. This is justified 
since degenerate optimal paths have a typical distance of order 1 only. For more precise 
formulations of this "macroscopic uniqueness", see Fisher and Huse (1991), Hwa and Fisher 
(1994), Kinzelbach and Lassig (1995). 

To compute the unconstrained fluctuations of optimal alignments for uncorrelated se- 
quences, L has to be sufficiently large so that the result becomes independent of it: L 2 ^> 
(Aro(T)) 2 . The displacement fluctuations (Aro(t)) 2 and the tilt cost 5E t {6) are evaluated 
with the pinned initial condition (i); in the latter case, also the endpoint ro = 9T is pinned. 
The score variance (§) is computed with the initial condition (ii). 

On the other hand, the confinement cost 5E c (r c ) is determined by choosing L = r c and 
T ^> L 3//2 to(7) so that the result becomes independent of T and of the initial condition. 

For correlated sequences, we choose L again large enough so that the result becomes 
independent of it: L 2 ^> (AR(T)) 2 + r 2 . For T 3> t c , quantities defined per unit of t such as 
J- and SE will also become independent of the initial condition. 
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